Matriculation number: 30496A
The primary aim of this project is to conduct a comprehensive analysis of IMDb scores and ratings, age certification ratings, and the evolution of content quality over time for TV shows and movies on Netflix. This multifaceted approach serves several key objectives.
The dataset used in this analysis was obtained from Back 2 Viz Basics, which provides comprehensive information about various titles available on the popular streaming platform. The dataset includes details such as the title’s name, its type (whether it is a TV show or a movie), a brief description of the content, the year it was released, age certification rating, runtime (for TV shows: length of episodes; for movies: duration), IMDb score, and IMDb votes. The dataset is publicly available and was downloaded for the purpose of this project.
-The dataset consists of [5283] rows and [11] columns.
The dataset includes the following variables:
The data was provided in CSV format and was imported into R for analysis. Prior to analysis, some data cleaning and preprocessing steps were performed to handle missing values and ensure data consistency.
library(tidyverse)
library(ggplot2)
library(dplyr)
library(plotly)
library(htmlwidgets)
library(knitr)
netflix_data <- read.csv("D:/study/coding for data science/Netflix_Imdb_Project/Netflix_IMDB_Scores/Data/processed_data/cleaned_netflix_data.csv")
head(netflix_data)
## index id title type
## 1 0 tm84618 Taxi Driver MOVIE
## 2 1 tm127384 Monty Python and the Holy Grail MOVIE
## 3 2 tm70993 Life of Brian MOVIE
## 4 3 tm190788 The Exorcist MOVIE
## 5 4 ts22164 Monty Python's Flying Circus SHOW
## 6 5 tm14873 Dirty Harry MOVIE
## description
## 1 A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 2 King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not to enter, as "it is a silly place".
## 3 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 4 12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 5 A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
## 6 When a madman dubbed 'Scorpio' terrorizes San Francisco, hard-nosed cop, Harry Callahan – famous for his take-no-prisoners approach to law enforcement – is tasked with hunting down the psychopath. Harry eventually collars Scorpio in the process of rescuing a kidnap victim, only to see him walk on technicalities. Now, the maverick detective is determined to nail the maniac himself.
## release_year age_certification runtime imdb_id imdb_score imdb_votes
## 1 1976 R 113 tt0075314 8.3 795222
## 2 1975 PG 91 tt0071853 8.2 530877
## 3 1979 R 94 tt0079470 8.0 392419
## 4 1973 R 133 tt0070047 8.1 391942
## 5 1969 TV-14 30 tt0063929 8.8 72895
## 6 1971 R 102 tt0066999 7.7 153463
knitr::kable(summary(netflix_data))
| index | id | title | type | description | release_year | age_certification | runtime | imdb_id | imdb_score | imdb_votes | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0 | Length:5283 | Length:5283 | Length:5283 | Length:5283 | Min. :1953 | Length:5283 | Min. : 0.0 | Length:5283 | Min. :1.500 | Length:5283 | |
| 1st Qu.:1320 | Class :character | Class :character | Class :character | Class :character | 1st Qu.:2015 | Class :character | 1st Qu.: 45.0 | Class :character | 1st Qu.:5.800 | Class :character | |
| Median :2641 | Mode :character | Mode :character | Mode :character | Mode :character | Median :2018 | Mode :character | Median : 87.0 | Mode :character | Median :6.600 | Mode :character | |
| Mean :2641 | NA | NA | NA | NA | Mean :2016 | NA | Mean : 79.2 | NA | Mean :6.533 | NA | |
| 3rd Qu.:3962 | NA | NA | NA | NA | 3rd Qu.:2020 | NA | 3rd Qu.:106.0 | NA | 3rd Qu.:7.400 | NA | |
| Max. :5282 | NA | NA | NA | NA | Max. :2022 | NA | Max. :235.0 | NA | Max. :9.600 | NA |
str(netflix_data)
## 'data.frame': 5283 obs. of 11 variables:
## $ index : int 0 1 2 3 4 5 6 7 8 9 ...
## $ id : chr "tm84618" "tm127384" "tm70993" "tm190788" ...
## $ title : chr "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" "The Exorcist" ...
## $ type : chr "MOVIE" "MOVIE" "MOVIE" "MOVIE" ...
## $ description : chr "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ "12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area o"| __truncated__ ...
## $ release_year : int 1976 1975 1979 1973 1969 1971 1964 1980 1967 1966 ...
## $ age_certification: chr "R" "PG" "R" "R" ...
## $ runtime : int 113 91 94 133 30 102 170 104 110 117 ...
## $ imdb_id : chr "tt0075314" "tt0071853" "tt0079470" "tt0070047" ...
## $ imdb_score : num 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 7.3 ...
## $ imdb_votes : chr "795222" "530877" "392419" "391942" ...
knitr::kable(colSums(is.na(netflix_data)))
| x | |
|---|---|
| index | 0 |
| id | 0 |
| title | 0 |
| type | 0 |
| description | 0 |
| release_year | 0 |
| age_certification | 0 |
| runtime | 0 |
| imdb_id | 0 |
| imdb_score | 0 |
| imdb_votes | 0 |
fig_1a - Histogram of IMDb Scores: This
visualization separates the IMDb scores by content type (movies or
shows) and displays their distribution. It is a direct approach to
analyzing audience rating trends and can show if certain scores are more
common for either type of content.
fig_1a <- ggplot(netflix_data, aes(x = imdb_score, fill = type)) +
geom_histogram(binwidth = 0.5, position = "dodge", alpha = 0.7) +
labs(title = "Distribution of IMDb Scores on Netflix",
x = "IMDb Scores",
y = "Number of Titles") +
theme_minimal()
fig_1a
fig_1b - Pie Chart of Content Types:
This chart gives a quick visual representation of the proportion of
movies to TV shows in the dataset, although it doesn’t directly relate
to IMDb scores.
type_counts <- table(netflix_data$type)
fig_1b <- plot_ly(
labels = names(type_counts),
values = as.vector(type_counts),
type = "pie"
) %>%
layout(title = "Distribution of Types")
fig_1b
fig_1c - Scatter Plot of IMDb Scores
vs. Runtime: This plot can help to identify if there’s a
relationship between the length of content and its IMDb rating, which
can be a factor in content investment decisions.
fig_1c <- ggplot(netflix_data, aes(x = runtime, y = imdb_score)) +
geom_point() +
labs(title = "IMDb Scores vs. Runtime",
x = "Runtime",
y = "IMDb Score") +
theme_minimal()
fig_1c
fig_1d - Average IMDb Score by Type Bar
Chart: This shows whether movies or TV shows have higher
average ratings, which is directly useful for content creators and
producers.
fig_1d <- netflix_data %>%
group_by(type) %>%
summarise(average_score = mean(imdb_score, na.rm = TRUE)) %>%
ggplot(aes(x = type, y = average_score, fill = type)) +
geom_bar(stat = "identity") +
labs(title = "Average IMDb Score by Type", x = "Type", y = "Average IMDb Score")
fig_1d
fig_1e - Boxplot of IMDb Scores by
Type: This provides insights into the spread and distribution
of ratings for movies and shows, including median scores and
outliers.
fig_1e <- plot_ly(netflix_data, x = ~type, y = ~imdb_score, type = 'box'
) %>%
layout(title = 'Boxplot of IMDb Scores by Type',
xaxis = list(title = 'Type'),
yaxis = list(title = 'IMDb Score'))
fig_1e
fig_1f - Treemap of Top IMDb Scores:
This highlights specific titles with scores and votes around a chosen
threshold, which can show which high-rated titles are in this score
range, but it may be somewhat arbitrary depending on the choice of
threshold and margin.
threshold_score <- 7.1
margin_of_error <- 0.3
suppressWarnings({
netflix_data$imdb_votes <- as.numeric(as.character(netflix_data$imdb_votes))
netflix_data$imdb_score <- as.numeric(as.character(netflix_data$imdb_score))
})
top_scores <- netflix_data %>%
filter(between(imdb_score, threshold_score - margin_of_error, threshold_score + margin_of_error)) %>%
arrange(desc(imdb_votes), desc(imdb_score)) %>%
head(16) %>%
mutate(hover_text = paste("Year:", release_year,
"<br>Type:", type,
"<br>IMDb Score:", imdb_score,
"<br>Votes:", format(imdb_votes, big.mark = ",")))
# Create the treemap plot
fig_1f <- plot_ly(
data = top_scores,
labels = ~title,
parents = rep("", nrow(top_scores)),
type = 'treemap',
hoverinfo = 'text',
hovertemplate = '%{customdata}<extra></extra>',
customdata = ~hover_text,
textfont = list(size = 16)
) %>%
layout(
title = "Top Average IMDb Scores by Vote Count within Score Range",
margin = list(l = 10, r = 10, b = 10, t = 40)
)
fig_1f
fig_2a - Target Audience Analysis:This
visualization provides a comparative analysis of the number of Netflix
titles across different age certifications, separated by content type
(Movie or Show). By utilizing a bar chart, the graph distinctly
illustrates the count of titles for each age certification category,
with a clear differentiation between movies and shows.
audience_target_analysis <- netflix_data %>%
group_by(age_certification, type) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(desc(count))
# Visualize the number of titles targeting each age certification, split by content type
fig_2a <- ggplot(audience_target_analysis, aes(x = age_certification, y = count, fill = type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Target Audience Analysis by Age Certification and Content Type",
x = "Age Certification", y = "Number of Titles",
fill = "Content Type") +
theme_minimal()
fig_2a
fig_2b - IMDb Scores by Age
Certification: The fig_2b visualization presents a detailed
analysis of IMDb scores categorized by age certification for the content
available on Netflix. This is achieved through a series of boxplots,
each representing a different age certification category.
fig_2b <- ggplot(netflix_data, aes(x = age_certification, y = imdb_score, fill = age_certification)) +
geom_boxplot() +
labs(title = "IMDb Scores by Age Certification", x = "Age Certification", y = "IMDb Score")
fig_2b
fig_3a - Trends of IMDb Scores Over
Years: This visualization shows the average IMDb score for
titles released each year. By plotting a line graph, you can see trends
in how the average quality of content, as rated by IMDb users, has
changed over time.fig_3a <- netflix_data %>%
group_by(release_year) %>%
summarise(
average_score = mean(imdb_score, na.rm = TRUE),
se = sd(imdb_score, na.rm = TRUE) / sqrt(n()) # Standard error
) %>%
ggplot(aes(x = release_year, y = average_score)) +
geom_ribbon(aes(ymin = average_score - se, ymax = average_score + se), alpha = 0.2) +
geom_line() +
labs(title = "Trends of IMDb Scores Over Years", x = "Release Year", y = "Average IMDb Score") +
theme_minimal()
fig_3a
fig_3b - IMDb Scores vs. IMDb Votes Scatter
Plot: This scatter plot compares the IMDb scores with the
number of votes they’ve received, which might indicate the popularity
and audience engagement with titles of different quality. Using a
logarithmic scale for votes takes into account the wide range of vote
counts.
fig_3b <- plot_ly(netflix_data, x = ~imdb_votes, y = ~imdb_score, type = 'scatter', mode = 'markers',
marker = list(size = 5, opacity = 0.5)) %>%
layout(title = 'IMDb Scores vs. IMDb Votes',
xaxis = list(title = 'IMDb Votes', type = "log"), # Using log scale for X-axis
yaxis = list(title = 'IMDb Score'))
fig_3b
## Warning: Ignoring 16 observations
fig_3c - IMDb Scores Over the Years by
Type: By grouping the data by content type (movies or shows)
and plotting IMDb scores over time, you can compare how movies and shows
might differ in terms of quality trends. This could reveal if there are
distinct patterns in audience perception based on the type of
content.
fig_3c <- ggplot(netflix_data, aes(x = release_year, y = imdb_score, group = type, color = type)) +
geom_line(aes(group = interaction(release_year, type)), alpha = 0.3) + # Use interaction for grouping in geom_line
geom_smooth(se = FALSE, method = "loess" , formula = y ~ x) + # Add LOESS smoothed trend line
labs(title = "IMDb Scores Over the Years by Type", x = "Release Year", y = "IMDb Score") +
theme_minimal()
fig_3c
In our analysis of Netflix’s content library, we focused on evaluating IMDb scores and ratings, age certification distributions, and the evolution of content quality over time. Using a detailed dataset, we revealed key trends in audience preferences, essential for content creators and strategists.
Our investigation highlighted how different content types and age certifications are received by diverse audiences. For instance, the age certification analysis provided insights into Netflix’s target demographics, offering valuable information for advertisers and content creators. The trend analysis of IMDb scores over the years indicated shifts in content quality and audience expectations, reflecting evolving industry standards and viewer tastes.
Overall, our study presents a snapshot of Netflix’s dynamic content strategy, offering critical insights for stakeholders in the digital entertainment industry. As the platform continues to expand and diversify, such data-driven analyses are crucial in understanding and adapting to the changing landscape of viewer preferences.